So far we have covered the following sections:
Given the newly acquired skills, your assignment is to perform an analysis on a dataset of tweets already provided in the course. The analysis should contain topic modelling and sentiment analysis. Use different splits of the data you have to perform your analysis, compare, correlate and visualize.
The dataset can be found on this link in the tweets folder:
All of the tweets have geolocation on them so it would be natural to show the geographical distribution of the analysis you performed on the map you designed in Week 1 of the course. When visualizing the results of your analysis keep in mind that you can change the size, color, location or even boundries of the map. You can also hide and show regions depending on what is the point you are trying to make.
Make sure to correlate the findings with:
This information can be obtained from the sources mentioned in the lectures.
As an additional resource to help with the sentiment analysis, there is a Java based utility in the same folder named SentiStrenght. You can use this to confirm your results or as an additional case study.
This file integrates pretrained models and results from topic modelling and sentiment analysis (which can be found in different notebooks of the same GitHub repository) and concludes in our final results
If MongoDB is installed on your device and a database named Twitter is created, the tweets can be stored as database entries using the following code:
Note: If Mongo is not installed on your device it will yield a Connection Refused exception.
In [1]:
from pymongo import MongoClient
from pprint import pprint
client = MongoClient()
db = client.Twitter
In [2]:
import time
import numpy as np
import pandas as pd
import geopandas as gpd
In [3]:
start_time = time.time()
#we are filtering out tweets of different languages and outside of the US
filter_query = {
"$and":[ {"place.country_code":"US"}, { "lang": "en" } ]
}
#we are keeping only our fields of interest
columns_query = {
'text':1,
'entities.hashtags':1,
'entities.user_mentions':1,
'place.full_name':1,
'place.bounding_box':1
}
tweets = pd.DataFrame(list(db.tweets.find(
filter_query,
columns_query
)#.limit()
)
)
elapsed_time = time.time() - start_time
print elapsed_time
In [4]:
tweets.drop(['_id'],axis=1,inplace=True)
In [5]:
tweets.head()
Out[5]:
In [6]:
print len(tweets)
Extract data we need into their own columns (links, mentions, hashtags)
In [7]:
import re
# A function that extracts the hyperlinks from the tweet's content.
def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''
# A function that checks whether a word is included in the tweet's content
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
In [8]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
tweets['text'] = tweets['text'].apply(lambda tweet: re.sub(r"http\S+", "", tweet))
In [9]:
#Functions to extract hashtags and mentions from entities
def extract_hashtags(ent):
a=[]
[a.append(hasht['text'].lower()) for hasht in ent['hashtags']]
#[a.append(hasht['text']) for hasht in ent['hashtags']]
return a
def extract_mentions(ent):
users=[]
[users.append(usr_ment['screen_name'].lower()) for usr_ment in ent['user_mentions']]
#[users.append(usr_ment['screen_name']) for usr_ment in ent['user_mentions']]
return users
In [10]:
tweets['hashtags'] = map(extract_hashtags,tweets['entities'])
tweets['mentions'] = map(extract_mentions,tweets['entities'])
tweets.drop(['entities'],axis=1,inplace=True)
In [11]:
tweets['state'] = map(lambda place_dict: place_dict['full_name'][-2:] ,tweets['place'])
tweets['geography'] = map(lambda place_dict: place_dict['bounding_box'] ,tweets['place'])
tweets.drop(['place'],axis=1,inplace=True)
Convert all text to lowercase
In [12]:
#make all text lowercase
tweets['text'] = tweets.text.apply(lambda x: x.lower())
In [14]:
tweets.head()
Out[14]:
In [15]:
import os,json
In [16]:
from data.US_states import states #import a useful dictionary containing US states and their abbreviations
In [17]:
import geopandas as gpd
import pandas as pd
import pickle
import matplotlib.pyplot as plt
In [18]:
S_DIR = '/home/antonis/ipython notebooks/UvA/Fundamentals of Data Science/utils/US_shape'
filename = os.path.join(S_DIR, 'states.geojson')
US_shape = gpd.read_file(filename) #opens the .geojson file as geopandas dataframe
#US.sortlevel
US_shape = US_shape[['STATE_ABBR','geometry']]
US_shape.columns = ['NAME','geometry']
US_shape.set_index("NAME",inplace=True)
US_shape.head()
Out[18]:
In [30]:
nr_tweets_perstate = pickle.load(open('results/nr_tweets_perstate.pickle'))
In [ ]:
In [19]:
#create dicts to go back and forth from state abbr names
state_abbr_to_name = states
state_name_to_abbr = inv_map = {v: k for k, v in state_abbr_to_name.iteritems()}
state_name_to_abbr['United States'] = 'US'
In [20]:
#load US Cencus population data
path = '/home/antonis/ipython notebooks/UvA/Fundamentals of Data Science/Week1/sc-est2016-agesex-civ.csv'
population_data_all = pd.read_csv(path)[['NAME','SEX','AGE','POPEST2016_CIV']]
population_data_all.columns = ['NAME','SEX','AGE','POPULATION2016']
population_data_all['NAME'] = population_data_all['NAME'].apply(lambda x: state_name_to_abbr[x])
population_data_all.tail()
Out[20]:
In [21]:
state_data = population_data_all.groupby(by=['NAME'])[['POPULATION2016']].sum()
state_data.head()
Out[21]:
In [22]:
#get males and females by state
In [23]:
def population_by_gender(population_df,state,gender):
"""Pass as first argument the US census 2016 estimate dataframe
1 for male, 2 for female, 0 for total population"""
try:
state = state_name_to_abbr[state]
except:
pass
return population_df[(population_df.NAME == state) & (population_df.AGE == 999) & ((population_df.SEX == gender))]['POPULATION2016'].iloc[0]
In [24]:
males = pd.Series(map(lambda x:population_by_gender(population_data_all,x,1),state_data.index),index=state_data.index,name='males')
females = pd.Series(map(lambda x:population_by_gender(population_data_all,x,2),state_data.index),index=state_data.index,name='females')
In [ ]:
In [25]:
#calculate avg age per state
In [26]:
def wavg(group, avg_name, weight_name):
""" http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
In rare instance, we may not have weights, so just return the mean. Customize this if your business case
should return otherwise. Also reference to http://pbpython.com/weighted-average.html.
"""
d = group[avg_name]
w = group[weight_name]
try:
return float((d * w).sum()) / w.sum()
except ZeroDivisionError:
return d.mean()
In [27]:
avg_age = pd.Series(population_data_all[(population_data_all['SEX']==0) & (population_data_all['AGE']!=999)].groupby('NAME').apply(wavg, "AGE", "POPULATION2016"),name='avg_age')
In [ ]:
In [31]:
state_data = gpd.GeoDataFrame(pd.concat([state_data,males,females,nr_tweets_perstate,avg_age,US_shape],axis=1,join='inner'))
In [29]:
state_data.index.name="NAME"
US_shape.index.name="NAME"
In [ ]:
In [32]:
state_data.head()
Out[32]:
In [33]:
state_data['tweets_per_capita'] = state_data['nr_tweets']/state_data['POPULATION2016']
Scale and move states
In [34]:
%matplotlib inline
import geopandas as gpd
import warnings
from shapely.geometry.multipolygon import MultiPolygon
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import rgb2hex
from descartes import PolygonPatch
from shapely.geometry import Polygon, MultiPolygon
In [ ]:
In [35]:
#move scale alaska
a =state_data.loc[state_data.index=="AK"].index[0]
Al = state_data['geometry'][a]
Al_ser = gpd.GeoSeries(Al) #Convert to GeoSeries
Al_scale=Al_ser.scale(xfact=0.5, yfact=0.5,origin='center') #Scales Alaska
#sc_a=ser.buffer(-0.3)
Al_scale=Al_scale.translate(xoff=28, yoff=-39) #Moves Alaska down to the right
b=MultiPolygon(Al_scale.all()) #convert again GeoSeries to a multipolygon
state_data['geometry'][a]=b
In [36]:
#move scale hawaii
a =state_data.loc[state_data.index=="HI"].index[0]
Al = state_data['geometry'][a]
Al_ser = gpd.GeoSeries(Al) #Convert to GeoSeries
Al_scale=Al_ser.scale(xfact=1.8, yfact=1.8,origin='center') #Scales Alaska
#sc_a=ser.buffer(-0.3)
Al_scale=Al_scale.translate(xoff=53, yoff=0) #Moves Alaska down to the right
b=MultiPolygon(Al_scale.all()) #convert again GeoSeries to a multipolygon
state_data['geometry'][a]=b
In [ ]:
In [37]:
state_data.plot(column='nr_tweets',scheme = 'fisher_jenks',legend=True, cmap='OrRd',figsize=(15,8))
plt.axis('off')
plt.title("Number of tweets (in our sample)")
plt.show()
In [38]:
state_data.plot(column='tweets_per_capita',scheme = 'fisher_jenks',legend=False, cmap='OrRd',figsize=(15,8))
plt.axis('off')
plt.title("Tweets per capita in each state")
plt.show()
2012 and 2016 Presidential Elections data provided by Kaggle.com: https://www.kaggle.com/benhamner/2016-us-election
In [39]:
demographics = pd.read_csv('data/2012 and 2016 Presidential Elections/county_facts.csv')
demographics = demographics[['area_name','state_abbreviation','RHI825214','EDU635213','EDU685213']]
In [40]:
demographics.head()
Out[40]:
In [41]:
demographics = demographics[(demographics['state_abbreviation'].isnull()) & (demographics['area_name']!='United States')]
try:
demographics.drop('state_abbreviation',inplace=True)
except:
pass
In [ ]:
In [42]:
def change_to_abbr(name):
if name=='District Of Columbia':
return 'DC'
try:
return state_name_to_abbr[name]
except:
return 'dropme'
In [43]:
demographics['area_name'] = demographics.area_name.apply(lambda x: change_to_abbr(x))
In [44]:
demographics.set_index('area_name',inplace=True)
In [45]:
#add heuristic education score of each state: education_level = 0.5*highschool_level + 1*bachelor_or_higher_level
demographics['Education_score'] = (demographics['EDU635213']-demographics['EDU685213'])*0.5 + demographics['EDU685213']
In [46]:
demographics['non_white_popul'] = 100-demographics['RHI825214']
demographics = demographics[['non_white_popul','Education_score']]
demographics.index.name = 'NAME'
In [47]:
state_data = pd.concat([state_data,demographics],axis=1,join='inner')
In [48]:
state_data.head()
Out[48]:
In [49]:
from itertools import chain
hashtags = pd.Series(list(chain.from_iterable(tweets['hashtags'].values)))
hashtag_count = hashtags.value_counts()
In [51]:
hashtag_count.head()
Out[51]:
In [52]:
hashtag_count.head(10).plot(kind='bar')
plt.show()
In [53]:
top_hashtag = pd.Series(name='top_hashtag')
for state in state_abbr_to_name.keys():
htstate = tweets[tweets['state']==state]['hashtags']
temp = pd.Series(list(chain.from_iterable(htstate.values)))
top_hashtag.ix[state] = temp.value_counts().sort_values(ascending=False).index[0]
#top_hashtag.ix[state] = temp.value_counts().sort_values(ascending=False)[0]
In [54]:
state_data.columns
Out[54]:
In [ ]:
In [56]:
# state_data = state_data.drop(0,axis=1)
# #state_data.drop('top_hashtag',axis=1,inplace=True)
In [57]:
state_data = pd.concat([state_data,top_hashtag],axis=1,join='inner')
In [58]:
state_data.head()
Out[58]:
In [59]:
#export csv
#state_data.to_csv("state_data_demographics.csv")
In [ ]:
In [ ]:
In [69]:
#credits to https://stackoverflow.com/questions/38899190/geopandas-label-polygons
state_data['coords'] = state_data['geometry'].apply(lambda x: x.representative_point().coords[:])
state_data['coords'] = [coords[0] for coords in state_data['coords']]
state_data['state_area'] = state_data['geometry'].apply(lambda x: x.area)
state_data['state_area'] = state_data['geometry'].apply(lambda x: x.area)
state_data.plot(column='tweets_per_capita',scheme = 'fisher_jenks',legend=False, cmap='OrRd',figsize=(20,10))
for idx, row in state_data.iterrows():
plt.annotate(s='#'+row['top_hashtag'], xy=row['coords'],
horizontalalignment='center',size=(max(12,row.state_area/5)))
plt.show()
In [ ]:
In [72]:
topic_distribution_ps = pickle.load(open('results/topic_distribution_per_state_1.pickle'))
Define topics
In [73]:
from data.topics import topic_numbering
In [74]:
[topic_numbering[i] for i in range(0,10)]
Out[74]:
In [75]:
topic_distribution_ps.columns = [topic_numbering[i] for i in range(0,10)]
In [76]:
from data.US_states import regions
In [77]:
state_sentiment = pickle.load(open('results/state_sentiment_0.5.pickle'))
In [78]:
state_sentiment.head()
Out[78]:
In [79]:
state_sentiment.head()
Out[79]:
In [80]:
#normalize values and create an overal "preference indicator"
state_sentiment['T+%'] = state_sentiment['T+']/(state_sentiment['T+']+state_sentiment['T-'])
state_sentiment['C+%'] = state_sentiment['C+']/(state_sentiment['C+']+state_sentiment['C-'])
state_sentiment['overall'] = state_sentiment['T+%']/state_sentiment['C+%']
In [81]:
state_sentiment.describe()
Out[81]:
In [82]:
#investigate topic1 (racism)
df = topic_distribution_ps.merge(state_sentiment[['T+%','C+%','overall']],right_index=True,left_index=True)
In [83]:
df.drop(['?','Weather (?)'],axis=1,inplace=True)
In [84]:
df['region'] = map(lambda x: regions[x] ,df.index)
In [ ]:
In [85]:
df.head()
Out[85]:
In [ ]:
In [86]:
import seaborn as sns
Correllate topic: racism with politics
In [87]:
topic_distribution_ps_clean = topic_distribution_ps.drop(['?','Weather (?)'],axis=1)
In [88]:
topic_distribution_ps_clean.head()
Out[88]:
In [89]:
sns.heatmap(topic_distribution_ps_clean.corr(),cmap='jet',)
plt.xticks(rotation=45)
plt.title("")
Out[89]:
In [90]:
df.to_csv(open('maarten/df_topics_and_sentiment_per_state.csv','wb'))
In [ ]:
In [91]:
import numpy as np
In [92]:
sns.lmplot('Discrimination','overall',df,fit_reg=False,hue='region')
plt.ylabel("Hillary -> Trump")
plt.show()
In [96]:
df.columns
Out[96]:
In [97]:
sns.lmplot('Discrimination','T+%',df,fit_reg=False,hue='region')
plt.show()
In [98]:
sns.lmplot('Discrimination','C+%',df,fit_reg=False,hue='region')
plt.show()
In [95]:
# Result: maybe clinton supporters talk/critisize about racism a lot?
In [ ]:
Correllate topic: Financial with supporters
In [112]:
sns.lmplot('Business & financial','overall',df,fit_reg=False,hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [ ]:
Correllate topic: Foreign Affairs with supporters
In [115]:
sns.lmplot('Foreign affairs','overall',df,fit_reg=False,hue='region')#hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [ ]:
In [129]:
sns.lmplot('Domestic policy','overall',df,fit_reg=False,hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [137]:
sns.lmplot('Elections (motivation & engagement)','overall',df,fit_reg=False,hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [138]:
sns.lmplot('Elections (candidacy)','overall',df,fit_reg=False,hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [139]:
sns.lmplot('Elections (neutral)','overall',df,fit_reg=False,hue='region',)
plt.ylabel("Hillary -> Trump")
plt.show()
In [ ]:
In [ ]:
Topics of discussion per state
In [99]:
# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(9, 10))
sns.heatmap(topic_distribution_ps, annot=False, ax=ax, cmap="YlGnBu")
plt.title("Topic discussion distribution per state")
plt.show()
In [100]:
topic_distribution_ps.head()
topic_distribution_ps['region'] = df['region'].copy()
In [101]:
topic_distribution_ps_clean.columns
Out[101]:
In [102]:
topic_elections = topic_distribution_ps_clean[['Discrimination', u'Elections (candidacy)', u'Lies & subjectivity','Elections (motivation & engagement)',
u'Elections (neutral)']]
In [103]:
topic_affairs = topic_distribution_ps[['Foreign affairs','Domestic policy','Business & financial']]
In [ ]:
In [156]:
sns.pairplot(topic_elections,kind='reg',)#hue='region')
plt.show()
In [ ]:
In [143]:
sns.pairplot(topic_distribution_ps_clean,kind='reg',)
Out[143]:
In [264]:
sns.pairplot(topic_affairs,kind='reg',)#hue='region')
plt.show()